I. Introduction

Air Transport has become more and more important in many industries. It plays an important role not only in our personal life but also in the global economic activities. As data collected by statista from IATA and ICAO, the number of total flights has been steadily increasing in the recently 10 years. https://www.statista.com/statistics/564769/airline-industry-number-of-flights/

library(plotly)
flights<-data.frame(year=c("2008","2009","2010","2011","2012","2013","2014","2015","2016",
                           "2017","2018"), flights=c(26.5,25.9,27.8,30.1,31.2,32,33,
                                                     34,35.4,36.8,38.1))
library(ggplot2)
p <- ggplot(flights,aes(x=year,y=flights))+
  geom_col(fill="lightblue", col="grey")+
  ylab("Number of Flights in Millions")+
  xlab("Year")+
  ggtitle("Number of Flights By Year")

ggplotly(p)

However, unfortunately, aviation accidents take place every year. Hence, we want to perform research on plane accidents.In this report, we will discuss the trend of accidents by year, the relationship between plane accidents and the phase of flight, the relationship between the plane accidents and the weather condition and the relationship between accidents and the air transport type.

II. Description of the data source

Our dataset comes from the NTSB accident database, which contains information from 1982 and later about civil accidents and some selected incidents. There is a report about the accident and its probable cause on the website. Although there are many aviation accidents database, the database of NTSB is more detailed and easy to download. The question we are concerned about can be solved by this database. Meanwhile, it has a corresponding report, so we can look at the report to find the information we need. There are more than 80,000 observations in this database, which contains the information from 1982 and 2019. In this database, it has variables about the date, the place and the casualties of the accidents. It also contains information about the plane, the engine and the air carrier, which can help us analyze the relationship between accidents and different variables. As technology has improved a lot in recent years, we think the data years ago are not valuable for us to draw conclusions to what we want to research. In addition, the dataset of 2019 is not complete, so it may provide wrong information, especially when we want to find the trend by year. Hence, we only use the data from 2008 to 2018 to analyze the plane accidents. https://www.ntsb.gov/_layouts/ntsb.aviation/index.aspx

The detailed information about each variable in the dataset can be found in the link below. https://www.ntsb.gov/_layouts/ntsb.aviation/AviationDownloadDataDictionary.aspx

For people who are interested in the past air accident, we created a 10 worst aviation accidents bar chart(defined by the total Fatalities), you can click on the airplane mode(under the airplane picture) to see more detail about the accident at ASN (aviation safety network) website.

10 worst aviation accidents

III. Description of data import / cleaning / transformation

The website provide XML format and txt format dataset for downloading, so we downloaded the XML dataset and use the XML package to deal with the dataset. We first transform the dataset to a list and then transform the list to a dataframe. Then we can work with the dataset in R.

library(XML)
dataset.1<-xmlToDataFrame("AviationData.xml")
avi<-xmlParse("AviationData.xml")
list.avi<-xmlToList(avi)
avi.d<-t(data.frame(list.avi,stringsAsFactors = F))
avi.d<-data.frame(avi.d, stringsAsFactors = F)
rownames(avi.d)<-as.character(1:dim(avi.d)[1])

As mentioned above, we only use the data from 2008 to 2018, so we extract the year of the Event Date and add a new variable ‘year’ to help us select the data we want to use.

avi.d$year<-regmatches(avi.d$EventDate,regexpr("[0-9]{4}",avi.d$EventDate))
avi.d.r<-avi.d[avi.d$year >= "2008"& avi.d$year!="2019",]

Then after observing our dataset, we find that the Column ‘InjurySeverity’ contains information about the causalties. In addition, it only contains 160 unavailable observations. If the observations in Column ‘InjurySeverity’ are unavailable, we use NA to fill in the blank and we use 0 to fill in the blank otherwise for Column ‘TotalFatalInjuries’, ‘TotalSeriousInjuries’, ‘TotalMinorInjuries’, ‘TotalUninjured’.

## Fill in the number about Injuries Using InjurySeverity
fill.in<-function(x)
{
  out<-ifelse(x=="",0,x)
  return(out)
}  
avi.d.r<-data.frame(avi.d.r)
avi.d.r[avi.d.r$InjurySeverity=="Unavailable",]$TotalFatalInjuries<-NA
avi.d.r[avi.d.r$InjurySeverity=="Unavailable",]$TotalSeriousInjuries<-NA
avi.d.r[avi.d.r$InjurySeverity=="Unavailable",]$TotalMinorInjuries<-NA
avi.d.r[avi.d.r$InjurySeverity=="Unavailable",]$TotalUninjured<-NA
avi.d.r[,c("TotalFatalInjuries","TotalSeriousInjuries","TotalMinorInjuries","TotalUninjured")]<-apply(avi.d.r[,c("TotalFatalInjuries","TotalSeriousInjuries","TotalMinorInjuries","TotalUninjured")],2,fill.in)

Since the accidents in the dataset are all inclusive, including all severities in terms of loss of life and damage to aircraft, we want to divide all the accidents into 3 category according to its severity. If anyone lost his life or injured seriously in an accident, we assigned it to the ‘Fatal or Serious Injuries’ category. Then if only someone injured minorly in an accident, we assigned it to the ‘Minor Injuries’ category. We assigned the other accidents into ‘No Injuries’ category.

avi.d.r$AccidentCategory<-
  ifelse(avi.d.r$TotalMinorInjuries!=0&avi.d.r$TotalFatalInjuries==0
         &avi.d.r$TotalSeriousInjuries==0,"Minor Injuries",
         ifelse(avi.d.r$TotalFatalInjuries!=0|avi.d.r$TotalSeriousInjuries!=0,
                "Fatal or Serious Injuries", "No Injuries"))

IV. Analysis of missing values

As mentioned above, we use the Column ‘InjurySeverity’ to fill in the blank relevant to causaltites. Then we convert the other blanks to NA and check the missing rate of all variables in the dataset.

## Convert blank to NA
blank.to.na<-function(x)
{
  out<-ifelse(x=="",NA,x)
  return(out)
}
avi.d.r<-apply(avi.d.r,2,blank.to.na)
## Count the missing data in each Column
mis<-function(x)
{
  out<-sum(is.na(x))
  return(out)
}
mis.num<-apply(avi.d.r,2,mis)/dim(avi.d.r)[1]
mis.num
##              EventId    InvestigationType       AccidentNumber 
##         0.0000000000         0.0001063264         0.0000000000 
##            EventDate             Location              Country 
##         0.0000000000         0.0012227539         0.0000000000 
##             Latitude            Longitude          AirportCode 
##         0.1037745880         0.1037214248         0.3565656566 
##          AirportName       InjurySeverity       AircraftDamage 
##         0.2066985646         0.0000000000         0.0509303562 
##     AircraftCategory   RegistrationNumber                 Make 
##         0.0297182350         0.1237107921         0.0026581606 
##                Model         AmateurBuilt      NumberOfEngines 
##         0.0032961191         0.0281765019         0.1099946837 
##           EngineType       FARDescription             Schedule 
##         0.1309941520         0.0478468900         0.9029771398 
##      PurposeOfFlight           AirCarrier   TotalFatalInjuries 
##         0.1514619883         0.9475810739         0.0085061138 
## TotalSeriousInjuries   TotalMinorInjuries       TotalUninjured 
##         0.0085061138         0.0085061138         0.0085061138 
##     WeatherCondition   BroadPhaseOfFlight         ReportStatus 
##         0.1171185540         0.2274322169         0.0000000000 
##      PublicationDate                 year     AccidentCategory 
##         0.0690058480         0.0000000000         0.0085061138

There are some variables such as ‘AirportCode’, ‘AirportName’, ‘Schedule’, whose missing rate is very high and they are not relevant to our research, so we should delete these variables.

avi.d.r<-avi.d.r[,-9]
avi.d.r<-avi.d.r[,-9]
avi.d.r<-avi.d.r[,-19]
n<-dim(avi.d.r)[1]

And as we just want to find the general relationship between the plane accidents and the other variables, we just select those observations which contain the variables we are interested in to find the general relationship. As there are 18810 observations in our dataset, a general trend can be found even if there are some missing values.

V. Results

Accidents By Year

First, we count the number of accidents of different years and check if there is any trend of the number of accidents by year. We also include a 3-year-moving-average line to help us find the trend of the number of accidents.

library(dplyr)
library(ggplot2)
avi.d.r<-data.frame(avi.d.r)
accidents_by_year<-avi.d.r %>%
         group_by(year) %>%
         count(year)
n.acc.year<-dim(accidents_by_year)[1]
moving.acc<-rep(0,7)
for(i in 1:9)
{
  moving.acc[i]<-mean(accidents_by_year$n[i:i+2])
}
moving.data<-data.frame(year=accidents_by_year$year[3:n.acc.year],three_year_average=moving.acc)
label<-c("accidents","3-year-moving-average")
cols.acc.y <- c("3-year-average"="red","accidents"="lightblue")
p <- ggplot()+
  geom_bar(data = data.frame(avi.d.r),aes(x = year,fill="accidents"),col='grey')+
  geom_point(data = moving.data, aes(x = year, y = three_year_average))+
  geom_line(data = moving.data, aes(x = year, y = three_year_average,group=1,col='3-year-average'))+
  scale_color_manual(name = "Line", values = cols.acc.y)+
  ggtitle(label = "Accidents By Year")+
  ylab("Number of Accidents")+
  xlab("Year")+
  scale_fill_manual(name="Bar",values=cols.acc.y)
ggplotly(p)

We can see that the number of aviation accidents are about 1500 after 2013. As it is mentioned above that the number of flights steadily increases. Even we cannot combine them together and calculate the accident rate per year because our data of total flights number and the data of accident is coming from different sources. We should still be able to conclude that the accident rate in fact decreases.

Deaths By Year

There is another problem people are curious about is the total fatalities every year. The fatal accidents are worth attention.

## Death By year
death_by_year<-avi.d.r %>%
  filter(is.na(TotalFatalInjuries)==F) %>%
  group_by(year) %>%
  summarise(sum(as.numeric(as.character(TotalFatalInjuries))))
colnames(death_by_year)<-c("year","death")
p <- ggplot(death_by_year, aes(x=year,y=death))+
  geom_point(col="blue")+
  geom_line(group=1,col="blue")+
  ggtitle("Deaths By Year")+
  xlab("Year")+
  ylab("Death")

ggplotly(p)

We can see that the number of deaths decreases with fluctuation. However, there is above 800 people died in aviation accidents in 2018, which is still a surprisingly large number. Hence, we should pay attention to the air transport safety.

Accident Category By Year

We want to see if the composition of accidents are different year by year, i.e, if the proportion of fatal accidents are the same year by year, so we use a doubledecker plot to check the accident category.

## Accident Category By Year
library("tidyverse")
cate_by_year<-avi.d.r %>%
  filter(is.na(AccidentCategory)==F) %>%
  group_by(year) %>%
  count(AccidentCategory)
library("ggmosaic")
p <- ggplot(cate_by_year)+
  ggmosaic::geom_mosaic(aes( weight = n, x = ggmosaic::product(year), fill = AccidentCategory))+
  ylab("Number of Accidents")+
  xlab("Accident Category")
ggplotly(p)

From the graph, we can find that the proportion of accident category is almost the same every year. It seems that accident that takes no injuries is about more than half of the total accident. However, there are more fatal or serious injuries than minor injuries. The reason might be once the accident is a serious one, it will more likely to be fatal. We can analyze a specific year to see the percentage.

## Accident Category In 2018
cate_by_year_2018<-avi.d.r %>%
  filter(is.na(AccidentCategory)==F) %>%
  filter(year=='2018') %>%
  group_by(AccidentCategory) %>%
  count(AccidentCategory)
cate_by_year_2018 <- cate_by_year_2018 %>%  
  summarize(count = sum(n)) %>%
  mutate(percent = count/sum(count))
p <- ggplot(cate_by_year_2018, aes(x = AccidentCategory, y = round(percent,2) , fill = AccidentCategory)) +
  geom_col()+
  ggtitle("Accident Category In 2018") +
  ylab("Number of Accidents")+
  xlab("Year")

ggplotly(p)

From the graph, the proportion of accident category is shown clearly. The ‘No Injuries’ Accident takes the largest proportion, the ‘Fatal or Serious Injuries’ Accident is in the middle and the ‘No Injuries’ Accident is the least. The result is in line with our common sense.

Accidents By Phase

Then we want to find the relationship between accidents and phase. We are curious about if there is any stage that the accidents are most likely to take place. We use a bar plot to find the relationship.

## Accidents By Phase
library(tidyverse)
accidents_by_phase<-avi.d.r %>%
  group_by(BroadPhaseOfFlight) %>%
  count(BroadPhaseOfFlight)
n.acc.p<-sum(accidents_by_phase$n)
accidents_by_phase<-accidents_by_phase %>%
  filter(BroadPhaseOfFlight!=""& BroadPhaseOfFlight !="UNKNOWN") %>%
  mutate(rate=(n/n.acc.p))
accidents_by_phase$rate <- round(accidents_by_phase$rate,2)
p <- ggplot(data = accidents_by_phase, 
       aes(x= fct_reorder(BroadPhaseOfFlight,rate), rate ))+
  geom_col(fill="lightblue",col="grey")+
  coord_flip()+
  ylab("Broad Phase of Flight")+
  xlab("Number of Accidents")+
  ggtitle(label = "Accidents By Broad Phase of Flight")

ggplotly(p)

From the graph, we can see that landing and taking off is the top two phase when the accidents take place, which is in line with our common sense. All the accidents take place during these two phases account for approximately 60 percent of all the accidents. Hence, the landing and taking off phase play an important role in aviation safety.

Accident Category By Phase

Then, let’s do research on if the phase of flight has relationship with the severity of accidents. Since the first five phase takes more than 80 percent, we only do reasearch on that five phases.

levels<-c("LANDING","TAKEOFF","MANEUVERING","APPROACH","CRUISE")
cate_by_phase<-avi.d.r %>%
  filter(is.na(BroadPhaseOfFlight)==F) %>%
  filter(is.na(AccidentCategory)==F) %>%
  filter(BroadPhaseOfFlight!=""& BroadPhaseOfFlight !="UNKNOWN") %>%
  filter(BroadPhaseOfFlight=="LANDING"| BroadPhaseOfFlight =="TAKEOFF"|
          BroadPhaseOfFlight =="MANEUVERING"|BroadPhaseOfFlight =="APPROACH"|
           BroadPhaseOfFlight =="CRUISE") %>%
  select(BroadPhaseOfFlight,AccidentCategory)%>%
  mutate(Broad_Phase_new=as.vector(BroadPhaseOfFlight))
library(ggmosaic)

p <- ggplot(data = cate_by_phase)+
  geom_mosaic(aes(x=product(AccidentCategory, Broad_Phase_new ),
                  fill = AccidentCategory),
              divider = c("hspine" , "vspine"))+
  xlab("Accident Category")+
  ylab("Broad Phase") +
  ggtitle("Accident Category By Phase")
ggplotly(p)

From the graph, we can find the proportion of different accident categories is different in different broad phase. Although accidents are more likely to happen in landing and take off phase, the ‘No Injuries’ accidents are the most in these two phases while the ‘Fatal or Serious Injuries’ accidents are the most in the other three phases. Maybe in the other phases except the landing and take off, it is more difficult to control the severity level of the accidents.

Accidents By Weather Condition

Next, we discuss whether the number of accidents are related to the weather condition. First, let’s give some explanations of the symbol of weather condition in this dataset. The symbol ‘UNK’ means unknown and should be deleted when we analyze the data. The symbol ‘VMC’ means the weather and the vision are good while ‘IMC’ means the weather or the vision is bad.

## Accidents by Weather Condition
accidents_by_weather<-avi.d.r %>%
  filter(is.na(WeatherCondition)==F) %>%
  filter(WeatherCondition!='UNK') %>%
  group_by(WeatherCondition) %>%
  count(WeatherCondition) 
p <- ggplot(data = accidents_by_weather, 
       aes(x= fct_reorder(WeatherCondition,-n), n ))+
  geom_col(fill="lightblue",col="grey")+
  xlab("Weather Condition")+
  ylab("Number of Accidents")+
  ggtitle(label = "Accidents By Weather Condition")

ggplotly(p)

The number of accidents under ‘VMC’ condition is far more than that under ‘IMC’ condition, which is likely to cause by the number of flights under ‘VMC’ condition is far more than that under ‘IMC’ condition.

Accident Category By Weather Condition

Then we still want to do research on whether the serious accident are more likely to take place under some weather condiitons.

## Accidents Category by Weather Condition
cate_by_weather<-avi.d.r %>%
  filter(is.na(WeatherCondition)==F) %>%
  filter(is.na(AccidentCategory)==F) %>%
  filter(WeatherCondition!='UNK') %>%
  select(WeatherCondition,AccidentCategory)%>%
  mutate(w_con_new=as.vector(WeatherCondition))
library(ggmosaic)
p <- ggplot(data = cate_by_weather)+
  geom_mosaic(aes(x=product(AccidentCategory, w_con_new),
                  fill = AccidentCategory),
              divider = c("hspine" , "vspine"))+
  xlab("Accident Category")+ylab("Weather Condition")+
  ggtitle("Accident Category by Weather Condition")

ggplotly(p)

We can see that even though most accidents take place under good weather condition, ‘Fatal or Serious Injuries’ accident takes the largest proportion under ‘IMC’ while ‘No Injuries’ takes the largest proportion under ‘VMC’ condition, which is in line with our common sense. Hence, in bad weather, the severe accidentes are very likely to take place and we must pay more attention to that.

Death Rate By Weather

library(GGally)
## Death and Accidents By Weather
library(GGally)
## Death and Accidents By Weather
death_by_weather<-avi.d.r %>%
  filter(is.na(TotalFatalInjuries)==F) %>%
  filter(is.na(WeatherCondition)==F) %>%
  filter(WeatherCondition!='UNK') %>%
  group_by(WeatherCondition) %>%
  summarise(sum(as.numeric(TotalFatalInjuries)))
colnames(death_by_weather)<-c("WeatherCondition","death")
death_by_weather<-inner_join(death_by_weather,accidents_by_weather)
death_by_weather<- death_by_weather %>%
  mutate(death_rate=death/n)
p1<-ggplot(data = accidents_by_weather, 
       aes(x= fct_reorder(WeatherCondition,n), n, fill = WeatherCondition))+
  geom_col(col="grey")+
  xlab("Weather Condition")+
  ylab("Number of Accidents")+
  ggtitle(label = "Accidents By Weather Condition")+
  theme(legend.position = "none")
p2<-ggplot(death_by_weather, aes(x=WeatherCondition,y=round(death_rate,2),fill = WeatherCondition))+
  geom_col(col="grey")+
  ylab("Death Rate")+
  ggtitle("Deaths By Year")+theme(legend.position = "none")+
  xlab("Weather Condition")
library(gridExtra)
subplot(ggplotly(p1), ggplotly(p2), margin = 0.04)

From the graph, we can see clearly that even though the number of accidents under ‘VMC’ is far more than that under ‘IMC’, the death rate under ‘IMC’ is much higher than that under ‘VMC’. Namely, the death rate in bad weather condition is much higher, so more attention should be paid when the weather is not good. Flights had better not be arranged when the weather condition is bad. (Death rate is equal to the total Fatalities divided by total number of accident within each weather condition)

Accidents By Type of Air Transport

Finally, we want to check the relationships between the number of accidents with the type of air transport. We use the variable FARDescription instead of Purpose of Flight to find the relationship because FARDescription contains information about the regulations that this plane must obey as well as the purpose of the plane, so it may be related to the accidents. For example, ‘Part 91:General Aviation’ represents small non-commercial aircraft within United States and the oprator must obey the regulations within part 91. The categories in FARDescription are almost clear, and I just want to clarify some categories. Part 135 represents non-scheduled commercial air service while Part 121 represents scheduled air service. And Part 133 anthorize the operator to utilize helicopters to carry external loads. ‘Part 91 Subpart K: Fractional’ represents the fractional aircraft, which permitted shared ownership of an aircraft. Part 125 represents air plane which can only seat 20 to 6000 passenagers.

## Accidents By Type of Air Transport
accidents_by_purpose<-avi.d.r %>%
  filter(is.na(FARDescription)==F) %>%
  filter(FARDescription!='Unknown') %>%
  group_by(FARDescription) %>%
  count(FARDescription)
p <- ggplot(data = accidents_by_purpose, 
       aes(x= fct_reorder(FARDescription,n), n ))+
  geom_col(fill="lightblue",col="grey")+
  xlab("Type of Air Transport")+
  ylab("Number of Accidents")+
  ggtitle(label = "Accidents By Type of Air Transport")+
  coord_flip()
ggplotly(p)

We can see that most of the aircrafts which suffered accidents arenon-commercial aircraft within United States. One possible reason is that such kind of air transport takes the largest proportion of all the flights and meanwhile, the regulations of this kind of flights are less strict are another possible reason.

VI. Interactive component

In the plot below, it shows where did all the fatal accidents happened Since for some accidents, the location information(Latitude and Longitude) is missing. It will not be shown on the map. The radius of the circle is determined by the fatalities, and you can click on the circle to see the relative information.

We found the top three accident(defined by fatal number), the first two doesn’t have any location information. We manually searched those flight on ASN.

The worst case was happened in Thursday 17 July 2014, and destroyed at Hrabove.

The second worst case was happened in Saturday 8 March 2014, and destroyed at within Indian Ocean.

We can see from the plot, there are lot of fatal accident within US. However, as we mentioned and explained above, the part 91 have the max number of accident, therefore it is not surprisingly that there are more circles lands on US.

library(leaflet)

df <- as.data.frame(avi.d.r)
df$Latitude <- as.numeric(as.character(df$Latitude))
df$Longitude <- as.numeric(as.character(df$Longitude))
df$TotalFatalInjuries <- as.numeric(as.character(df$TotalFatalInjuries))
df_expect_ZERO <- filter(df, !is.na(TotalFatalInjuries))
df_expect_ZERO <- filter(df_expect_ZERO, TotalFatalInjuries!=0)
#check the worst three accident
Top_Three <- sort(df_expect_ZERO$TotalFatalInjuries, decreasing = T)[1:3]
#df_expect_ZERO[df_expect_ZERO$TotalFatalInjuries=="295",]$EventDate
#df_expect_ZERO[df_expect_ZERO$TotalFatalInjuries=="239",]$EventDate
#df_expect_ZERO[df_expect_ZERO$TotalFatalInjuries=="228",]$EventDate

df_expect_ZERO <- filter(df_expect_ZERO, !is.na(Longitude))

content <-  c(NA)
i <- 0
for (i in 1:length(df_expect_ZERO$EventDate)){
content[i] <- base::paste(sep = " <br/>death number:",
  df_expect_ZERO$EventDate[i],
  df_expect_ZERO$TotalFatalInjuries[i]
)
}

df_expect_ZERO$content <- as.character(content)

#find the max fatal case
bins <- c(0, 20, 100, 250)
pal <- colorBin("blue",domain = df_expect_ZERO$TotalFatalInjuries, bins = bins)
pal2 <- colorBin("Blues", domain = df_expect_ZERO$TotalFatalInjuries, bins = bins)

# Show a CUSTOM circle at each position. Size in meters --> change when you zoom.
m=leaflet(data = df_expect_ZERO) %>% addTiles() %>% 
addCircles(~Longitude, ~Latitude, 
radius=~TotalFatalInjuries*500 , 
stroke = T, 
color =  ~pal2(df_expect_ZERO$TotalFatalInjuries),
opacity = 0.1,
fillOpacity = 2,
fillColor = ~pal(df_expect_ZERO$TotalFatalInjuries),

popup = ~as.character(content)
) %>% 
setView( lng = 0, lat = 0, zoom = 2)
m

VII. Conclusion

In summary, we find that the accident rate actually decreases with time. But although death toll decreases with fluctuation, a lot of people still die in aviation accidents every year, so air transport safety is worth our attention.

Next, the ‘No Injuries’ Accident takes the largest proportion of all accidents every year while the ‘Fatal or Serious Injuries’ Accident also takes large proportion. Most of the accidents take place during the taking off and landing phase, but the serious accidents take the largest proportion in accidents which do not take place in these two phases, which is a problem we should pay attention to.

Then although most accidents take place in good weather because most flights take place in a good weather, serious accidents are very likely to happen in a bad weather condition, which means more attention must be paid in a bad weather condition.

Finally, non-commercial aircraft within United States is most vulnerable to accidents, which may be caused by the large quantity and relatively loose regulations.

In this report, we perform research on an inclusive dataset and we just find some general relationships between the number of accidents and some other variables. In the future, we can do research on a specific type of plane. I think many people are curious about the plane accidents of commercial planes.